Efficient binary embedding of categorical data using BinSketch

نویسندگان

چکیده

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional vectors, and our distance estimation Cham computes close approximation of the Hamming between any two original vectors only their sketches. The minimum dimension required by to ensure good theoretically depends on sparsity data points—making it useful many real-life scenarios involving sparse We rigorous theoretical analysis approach supplement with extensive experiments several real-world sets, including one over million dimensions. show that duo is significantly fast accurate tasks such as $$\mathrm {RMSE}$$ , all-pair similarity, clustering when compared working full dataset other techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Binary Embedding for High-Dimensional Data

Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose two solutions which improve over existing approaches. The first method, Bilinear Binary Embedding (BBE), converts highdimensional ...

متن کامل

Efficient manipulation of binary data using pattern matching

Pattern matching is an important operation in functional programs. So far, pattern matching has been investigated in the context of structured terms. This article presents an approach to extend pattern matching to terms without (much of a) structure such as binaries which is the kind of data format that network applications typically manipulate. After introducing the binary datatype and a notat...

متن کامل

On Binary Embedding using Circulant Matrices

Binary embeddings provide efficient and powerful ways to perform operations on large scale data. However binary embedding typically requires long codes in order to preserve the discriminative power of the input space. Thus binary coding methods traditionally suffer from high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) wh...

متن کامل

Efficient Adaptive Data Compression Using Fano Binary Search Trees

In this paper, we show an effective way of using adaptive self-organizing data structures in enhancing compression schemes. We introduce a new data structure, the Partitioning Binary Search Tree (PBST), which is based on the well-known Binary Search Tree (BST), and when used in conjunction with Fano encoding, the PBST leads to the so-called Fano Binary Search Tree (FBST). The PBST and FBST can ...

متن کامل

Application of the Rasch Model in Categorical Pedigree Analysis Using Mcem: I Binary Data

An extension of the Rasch model with correlated latent variables is proposed to model correlated binary data within families. The latent variables have the classical correlation structure of Fisher (1918) and the model parameters thus have genetic interpretations. The proposed model is fitted to data using a hybrid of the Metropolis-Hastings algorithm and the MCEM modification of the EM-algorit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Data Mining and Knowledge Discovery

سال: 2022

ISSN: ['1573-756X', '1384-5810']

DOI: https://doi.org/10.1007/s10618-021-00815-y